Skip to content

Conversation

@ForeverAngry
Copy link
Contributor

Closes #2649

Rationale for this change

Add support for using bloom filters in the read path of pyiceberg.

Are these changes tested?

Yes.

Are there any user-facing changes?

I dont think so.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations.
As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.

Comment on lines 293 to 299
NestedField(
field_id=146,
name="bloom_filter_bytes",
field_type=MapType(key_id=147, key_type=IntegerType(), value_id=148, value_type=BinaryType()),
required=False,
doc="Map of column id to bloom filter",
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot just add a field; this requires a spec change: https://iceberg.apache.org/contribute/#apache-iceberg-improvement-proposals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko take a look now, i changed the spirit of the PR so that it:

  • doesn't modify the Iceberg specification
  • Doesn't change any existing behavior

Rather, this pr just provides the initial utilities needed to read bloom filters from Parquet files at the file level.

If merged, next steps would be integrating them into the read path.

@ForeverAngry
Copy link
Contributor Author

@Fokko I was kinda thinking that when I submitted it. But, it was done, and I figured id just send it to see if it sparked any interest.

That's good information though, sometimes I forget about the governance structures that exist for these projects.

@ForeverAngry
Copy link
Contributor Author

Thanks @ForeverAngry for working on this. Since this changes the specification, we have to go through an Iceberg improvement proposal to ensure that there is concensus across different implementations. As part of the change process, my main question would be; what's the added value on top the bloom filters that are embedded in the Parquet files.

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

@ForeverAngry ForeverAngry force-pushed the feature/bloom-filter-read-support branch from ee5901f to d6a123b Compare November 7, 2025 17:54
Add utility functions to read and check bloom filters directly from
Parquet files using PyArrow, without requiring Iceberg spec changes.

- get_parquet_bloom_filter_for_column(): Extract bloom filter from Parquet row group
- bloom_filter_might_contain(): Check if value might be in bloom filter

This provides foundation for future bloom filter integration without
modifying the Iceberg manifest specification.
@ForeverAngry ForeverAngry force-pushed the feature/bloom-filter-read-support branch from d6a123b to 55e8516 Compare November 7, 2025 18:00
@ForeverAngry ForeverAngry requested a review from Fokko November 7, 2025 18:17
@Fokko
Copy link
Contributor

Fokko commented Nov 7, 2025

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

This is correct, and it would give some optimization. However, bloom filters need to be tuned based on the cardinallity of the column. The price to pay is that we encode more information in the manifest files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

I would expect Pyarrow to do this automatically 🤔

@ForeverAngry
Copy link
Contributor Author

I guess, to me, the main benefit would be the ability to do file-level pruning before opening any files.

This is correct, and it would give some optimization. However, bloom filters need to be tuned based on the cardinallity of the column. The price to pay is that we encode more information in the manifest files.

As a result, this would also come with some secondary benefits like being able to do row group-level pruning within a Parquet file after opening it.

I would expect Pyarrow to do this automatically 🤔

Fair enough, if the "Juice isnt work the squeeze" as they say, then no big deal 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add read support for parquet bloom filters

2 participants